{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "2c05da4b",
   "metadata": {
    "id": "2c05da4b"
   },
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/quick-tour/metadata-filtering.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/quick-tour/metadata-filtering.ipynb)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "artificial-devil",
   "metadata": {
    "id": "artificial-devil",
    "papermill": {
     "duration": 0.037678,
     "end_time": "2021-04-16T15:12:08.268491",
     "exception": false,
     "start_time": "2021-04-16T15:12:08.230813",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Metadata filtering with Pinecone\n",
    "\n",
    "Metadata filtering is a new feature in Pinecone that allows you to apply filters on vector search based on metadata.\n",
    "You can add the metadata to the embeddings within Pinecone, and then filter for those criteria when sending the query. Pinecone will search for similar vector embeddings only among those items that match the filter.\n",
    "The metadata filtering accepts arbitrary filters on metadata, and it retrieves exactly the number of nearest-neighbor results that match the filters. For most cases, the search latency will be even lower than unfiltered searches.\n",
    "\n",
    "In this notebook, we will walk through a simple use of filtering while performing vector search on documents."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "atmospheric-wayne",
   "metadata": {
    "id": "atmospheric-wayne",
    "papermill": {
     "duration": 0.028014,
     "end_time": "2021-04-16T15:12:08.327699",
     "exception": false,
     "start_time": "2021-04-16T15:12:08.299685",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## Prerequisites"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "emotional-lyric",
   "metadata": {
    "id": "emotional-lyric",
    "papermill": {
     "duration": 0.027173,
     "end_time": "2021-04-16T15:12:08.383073",
     "exception": false,
     "start_time": "2021-04-16T15:12:08.355900",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Install dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "pleasant-transfer",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "pleasant-transfer",
    "outputId": "1a36938b-322a-4606-809b-58b3216d3995",
    "papermill": {
     "duration": 15.880968,
     "end_time": "2021-04-16T15:12:24.293137",
     "exception": false,
     "start_time": "2021-04-16T15:12:08.412169",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install -qU pandas==2.2.3 pinecone==6.0.2"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20ccf1fc",
   "metadata": {},
   "source": [
    "## Creating an Index\n",
    "\n",
    "We begin by instantiating an instance of the Pinecone client. To do this we need a [free API key](https://app.pinecone.io).\n",
    "\n",
    "We begin by instantiating an instance of the Pinecone client. To do this we need a [free API key](https://app.pinecone.io)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "cbbd3189",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "from pinecone import Pinecone\n",
    "\n",
    "# Initialize client\n",
    "api_key = os.environ.get(\"PINECONE_API_KEY\") or \"PINECONE_API_KEY\"\n",
    "pc = Pinecone(api_key=api_key)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd116f29",
   "metadata": {},
   "source": [
    "## Creating a Pinecone Index\n",
    "\n",
    "When creating the index we need to define several configuration properties. \n",
    "\n",
    "- `name` can be anything we like. The name is used as an identifier for the index when performing other operations such as `describe_index`, `delete_index`, and so on. \n",
    "- `metric` specifies the similarity metric that will be used later when you make queries to the index.\n",
    "- `dimension` should correspond to the dimension of the dense vectors produced by your embedding model. In this quick start, we are using made-up data so a small value is simplest.\n",
    "- `spec` holds a specification which tells Pinecone how you would like to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/troubleshooting/available-cloud-regions).\n",
    "\n",
    "There are more configurations available, but this minimal set will get us started."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "tvAzaNJZ1M8T",
   "metadata": {
    "id": "tvAzaNJZ1M8T",
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "index_name = \"pinecone-metadata-filtering\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "529ff307-64e7-4827-8110-cb6ff5343d43",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Delete the demo index if already exists\n",
    "if pc.has_index(name=index_name):\n",
    "    pc.delete_index(name=index_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "balanced-housing",
   "metadata": {
    "id": "balanced-housing",
    "papermill": {
     "duration": 16.057888,
     "end_time": "2021-04-16T15:12:41.454202",
     "exception": false,
     "start_time": "2021-04-16T15:12:25.396314",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pinecone import ServerlessSpec, Metric, CloudProvider, AwsRegion\n",
    "\n",
    "# Create an index\n",
    "index_config = pc.create_index(\n",
    "    name=index_name,\n",
    "    dimension=2,\n",
    "    metric=Metric.EUCLIDEAN,\n",
    "    spec=ServerlessSpec(cloud=CloudProvider.AWS, region=AwsRegion.US_EAST_1),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36716d3b-d238-4fa0-b353-5740a7b81fac",
   "metadata": {},
   "source": [
    "## Working with the Index\n",
    "\n",
    "Data operations such as `upsert` and `query` are sent directly to the index host instead of `api.pinecone.io`, so we use a different client object object for these operations. By using the `.Index()` helper method to construct this client object, it will automatically inherit your API Key and any other configurations from the parent `Pinecone` instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "blvbpzBAxPJO",
   "metadata": {
    "id": "blvbpzBAxPJO",
    "papermill": {
     "duration": 0.869129,
     "end_time": "2021-04-16T15:12:42.358177",
     "exception": false,
     "start_time": "2021-04-16T15:12:41.489048",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Instantiate an index client\n",
    "index = pc.Index(host=index_config.host)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "improved-season",
   "metadata": {
    "id": "improved-season",
    "papermill": {
     "duration": 0.038949,
     "end_time": "2021-04-16T15:12:42.437637",
     "exception": false,
     "start_time": "2021-04-16T15:12:42.398688",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "### Generate sample document data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "disciplinary-district",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 175
    },
    "id": "disciplinary-district",
    "outputId": "132a09be-6e8c-4f8e-baeb-13b6fbb0ef7a",
    "papermill": {
     "duration": 0.24115,
     "end_time": "2021-04-16T15:12:42.715499",
     "exception": false,
     "start_time": "2021-04-16T15:12:42.474349",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>vector</th>\n",
       "      <th>metadata</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>F-1</td>\n",
       "      <td>[1.0, 1.0]</td>\n",
       "      <td>{'category': 'finance', 'published': 2015}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>F-2</td>\n",
       "      <td>[2.0, 2.0]</td>\n",
       "      <td>{'category': 'finance', 'published': 2016}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>S-1</td>\n",
       "      <td>[3.0, 3.0]</td>\n",
       "      <td>{'category': 'sport', 'published': 2017}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>S-2</td>\n",
       "      <td>[4.0, 4.0]</td>\n",
       "      <td>{'category': 'sport', 'published': 2018}</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    id      vector                                    metadata\n",
       "0  F-1  [1.0, 1.0]  {'category': 'finance', 'published': 2015}\n",
       "1  F-2  [2.0, 2.0]  {'category': 'finance', 'published': 2016}\n",
       "2  S-1  [3.0, 3.0]    {'category': 'sport', 'published': 2017}\n",
       "3  S-2  [4.0, 4.0]    {'category': 'sport', 'published': 2018}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Generate some data\n",
    "import pandas as pd\n",
    "\n",
    "df = pd.DataFrame()\n",
    "df[\"id\"] = [\"F-1\", \"F-2\", \"S-1\", \"S-2\"]\n",
    "df[\"vector\"] = [[1.0, 1.0], [2.0, 2.0], [3.0, 3.0], [4.0, 4.0]]\n",
    "df[\"metadata\"] = [\n",
    "    {\"category\": \"finance\", \"published\": 2015},\n",
    "    {\"category\": \"finance\", \"published\": 2016},\n",
    "    {\"category\": \"sport\", \"published\": 2017},\n",
    "    {\"category\": \"sport\", \"published\": 2018},\n",
    "]\n",
    "df"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "leading-flesh",
   "metadata": {
    "id": "leading-flesh",
    "papermill": {
     "duration": 0.030901,
     "end_time": "2021-04-16T15:12:42.777653",
     "exception": false,
     "start_time": "2021-04-16T15:12:42.746752",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "### Insert vectors\n",
    "\n",
    "Most operations accept an optional param called `namespace`. When this parameter is not specified, the operation assumes you wish to use the default namespace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "nearby-skiing",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "nearby-skiing",
    "outputId": "ce5118ac-d926-4847-cdb5-744d2bb49049",
    "papermill": {
     "duration": 1.65623,
     "end_time": "2021-04-16T15:12:44.464926",
     "exception": false,
     "start_time": "2021-04-16T15:12:42.808696",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'upserted_count': 4}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Insert vectors without specifying a namespace\n",
    "index.upsert(vectors=zip(df.id, df.vector, df.metadata))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "46e20dc2-2dc8-4836-88f2-c16fe125a0ae",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'dimension': 2,\n",
       " 'index_fullness': 0.0,\n",
       " 'metric': 'euclidean',\n",
       " 'namespaces': {'': {'vector_count': 4}},\n",
       " 'total_vector_count': 4,\n",
       " 'vector_type': 'dense'}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "\n",
    "def is_fresh(index):\n",
    "    stats = index.describe_index_stats()\n",
    "    vector_count = stats.total_vector_count\n",
    "    return vector_count > 0\n",
    "\n",
    "\n",
    "while not is_fresh(index):\n",
    "    # It takes a few moments for vectors we just upserted\n",
    "    # to become available for querying\n",
    "    time.sleep(5)\n",
    "\n",
    "# View index stats\n",
    "index.describe_index_stats()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "large-tunisia",
   "metadata": {
    "id": "large-tunisia",
    "papermill": {
     "duration": 0.033445,
     "end_time": "2021-04-16T15:12:44.537290",
     "exception": false,
     "start_time": "2021-04-16T15:12:44.503845",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "### Fetch a vector\n",
    "\n",
    "Again, without specifying a namespace, the API will return results from the default namespace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d1855c4f",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "d1855c4f",
    "outputId": "efef2d73-f741-4446-93cd-3b04a6656dd8"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "FetchResponse(namespace='', vectors={'F-1': Vector(id='F-1', values=[1.0, 1.0], metadata={'category': 'finance', 'published': 2015.0}, sparse_values=None)}, usage={'read_units': 1})"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "index.fetch(ids=[\"F-1\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "eight-sixth",
   "metadata": {
    "id": "eight-sixth",
    "papermill": {
     "duration": 0.032846,
     "end_time": "2021-04-16T15:12:44.963008",
     "exception": false,
     "start_time": "2021-04-16T15:12:44.930162",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "### Query top-3 without filtering\n",
    "\n",
    "The `top_k` param is used to specify how many query results we would like returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "timely-allen",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "timely-allen",
    "outputId": "062ebdce-5ffb-4995-be61-711c358b3fe0",
    "papermill": {
     "duration": 0.152593,
     "end_time": "2021-04-16T15:12:45.148326",
     "exception": false,
     "start_time": "2021-04-16T15:12:44.995733",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'matches': [{'id': 'F-1',\n",
       "              'metadata': {'category': 'finance', 'published': 2015.0},\n",
       "              'score': 0.0,\n",
       "              'values': []},\n",
       "             {'id': 'F-2',\n",
       "              'metadata': {'category': 'finance', 'published': 2016.0},\n",
       "              'score': 1.99999905,\n",
       "              'values': []},\n",
       "             {'id': 'S-1',\n",
       "              'metadata': {'category': 'sport', 'published': 2017.0},\n",
       "              'score': 7.99999809,\n",
       "              'values': []}],\n",
       " 'namespace': '',\n",
       " 'usage': {'read_units': 1}}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_results = index.query(\n",
    "    vector=df[df.id == \"F-1\"].vector[0], top_k=3, include_metadata=True\n",
    ")\n",
    "query_results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "arabic-shooting",
   "metadata": {
    "id": "arabic-shooting",
    "papermill": {
     "duration": 0.034931,
     "end_time": "2021-04-16T15:12:45.223865",
     "exception": false,
     "start_time": "2021-04-16T15:12:45.188934",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "### Query results with articles in finance published after 2015\n",
    "\n",
    "By passing a `filter` condition, we can limit the matches to those matching specific criteria in addition to vector similarity. See [Understanding Metadata](https://docs.pinecone.io/guides/data/understanding-metadata) for more information about available filter conditions.  \n",
    "\n",
    "Even though we requeusted up to 3 results with `top_k=3`, we should expect to see only 1 article that matches this query due to the metadata filter applied."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "stuck-hardware",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "stuck-hardware",
    "outputId": "a5f79346-0499-4cde-9af9-04e59158a582",
    "papermill": {
     "duration": 0.151954,
     "end_time": "2021-04-16T15:12:45.412130",
     "exception": false,
     "start_time": "2021-04-16T15:12:45.260176",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'matches': [{'id': 'F-2',\n",
       "              'metadata': {'category': 'finance', 'published': 2016.0},\n",
       "              'score': 1.99999905,\n",
       "              'values': []}],\n",
       " 'namespace': '',\n",
       " 'usage': {'read_units': 1}}"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_results = index.query(\n",
    "    vector=df[df.id == \"F-1\"].vector[0],\n",
    "    top_k=3,\n",
    "    filter={\"category\": {\"$eq\": \"finance\"}, \"published\": {\"$gt\": 2015}},\n",
    "    include_metadata=True,\n",
    ")\n",
    "query_results"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "geological-competition",
   "metadata": {
    "id": "geological-competition",
    "papermill": {
     "duration": 0.035328,
     "end_time": "2021-04-16T15:12:45.490265",
     "exception": false,
     "start_time": "2021-04-16T15:12:45.454937",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "### Delete the index\n",
    "\n",
    "Once we're done with this demo we don't need the index anymore, so let's delete it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "valuable-rehabilitation",
   "metadata": {
    "id": "valuable-rehabilitation",
    "papermill": {
     "duration": 12.613954,
     "end_time": "2021-04-16T15:12:58.139886",
     "exception": false,
     "start_time": "2021-04-16T15:12:45.525932",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Delete the index\n",
    "pc.delete_index(name=index_name)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 51.37705,
   "end_time": "2021-04-16T15:12:58.702015",
   "environment_variables": {},
   "exception": null,
   "input_path": "/notebooks/quick_tour/namespacing.ipynb",
   "output_path": "/notebooks/tmp/quick_tour/namespacing.ipynb",
   "parameters": {},
   "start_time": "2021-04-16T15:12:07.324965",
   "version": "2.3.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
